1 research outputs found

    A High Performance Parallel Classifier for Large-Scale Arabic Text

    Get PDF
    Text classification has become one of the most important techniques in text mining. It is the process of classifying documents into predefined categories or classes based on their content. A number of machine learning algorithms have been introduced to deal with automatic text classification. One of the common classification algorithms is the k-Nearest Neighbor (k-NN) which is known to be one of the best classifiers applied for different languages including Arabic language and it is included in numerous experiments as a basis for comparison. Furthermore, it is a simple classification algorithm and very easy to implement since it does not require a training phase that most classification algorithms must have. However, the k-NN algorithm is of low efficiency because it requires a large amount of computational power for evaluating a measure of the similarity between a test document and every training document and for sorting the similarities. Such a drawback makes it unsuitable to handle a large volume of text documents with high dimensionality and in particular in the Arabic language. In our research, we propose to develop a parallel classifier for large-scale Arabic text that achieves the enhanced level of speedup, scalability, and accuracy. The proposed parallel classifier is based on the sequential k-NN algorithm. We test the parallel classifier using the Open Source Arabic Corpus (OSAC) which is the largest freely public Arabic corpus of text documents. We study the performance of the parallel classifier on a multicomputer cluster that consists of 14 computers. We report both timing and classification results. These results indicate that the proposed parallel classifier has very good speedup and scalability and is capable of handling large documents collections. Also, classification results show that the proposed classifier has achieved accuracy, precision, recall, and F-measure with higher than 95%
    corecore